Chapter · Scaling & Performance — Part II

Infrastructure,
Architecture & Beyond

Part I built the vocabulary — latency, percentiles, throughput, utilization — and covered database-level optimizations. This second part zooms out to infrastructure: how statelessness enables horizontal scaling, how load balancers distribute traffic, how databases themselves scale with replicas and shards, how CDNs defeat physics, how async processing slashes perceived latency, and how to choose between monoliths, microservices, and serverless. We close with the mental models that tie everything together.


01

Statelessness — The Key Enabler

Horizontal scaling only works if any server instance can handle any request. The moment one server holds exclusive data — a session object in memory, an uploaded file on its local disk, an in-process cache — that server becomes special. Lose it, and you lose that data. Route a user elsewhere, and they get errors. This is statefulness, and it breaks horizontal scaling.

Statelessness is the principle that no single server instance holds any information that other instances cannot access. Every piece of persistent state — sessions, files, cache entries, database records — must live in an external, shared store that all instances can reach.

✗ STATEFUL (BROKEN) Server A session in RAM Server B no session! → 401 ✗ ✓ STATELESS (CORRECT) Server A Server B Server C Redis (sessions) S3 (files) Postgres (data) All shared. Any instance can serve any request.
Fig 1 · Stateful servers break under horizontal scaling; stateless ones don't

What must be externalized

State TypeWrong (in-process)Right (external)
SessionsIn-memory dictionaryRedis / database / JWT tokens
File uploadsLocal filesystemS3, Cloudflare R2, MinIO
CacheIn-memory mapRedis, Memcached, Valkey
DatabaseSQLite file on diskPostgres (RDS), MySQL, centralized SQLite
Background jobsIn-process goroutine/threadRedis queue, RabbitMQ, Kafka

Stateless session management in Go

Go
import (
    "context"
    "github.com/redis/go-redis/v9"
    "time"
)

var rdb = redis.NewClient(&redis.Options{Addr: "redis:6379"})

// StoreSession saves session data in Redis so ANY server can read it.
func StoreSession(ctx context.Context, sessionID string, userID string) error {
    return rdb.Set(ctx, "session:"+sessionID, userID, 24*time.Hour).Err()
}

// GetSession retrieves session data — works regardless of which server
// the request lands on, because Redis is shared across all instances.
func GetSession(ctx context.Context, sessionID string) (string, error) {
    return rdb.Get(ctx, "session:"+sessionID).Result()
}

Stateless session in Python (FastAPI + Redis)

Python
import redis, uuid
from fastapi import FastAPI, Response, Cookie

app = FastAPI()
r = redis.Redis(host="redis", decode_responses=True)

@app.post("/login")
def login(response: Response, email: str, password: str):
    user = authenticate(email, password)
    session_id = str(uuid.uuid4())

    # Store in Redis — accessible by ALL server instances
    r.setex(f"session:{session_id}", 86400, user.id)
    response.set_cookie("session_id", session_id, httponly=True)
    return {"status": "ok"}
Thumb rule

If you can't delete a server instance at random and have your system continue working identically, your architecture is stateful and will break under horizontal scaling. Audit every os.MkdirTemp, every in-memory map, every local file write.


02

Load Balancer Algorithms

With multiple server instances, you need a mechanism to decide which instance handles each incoming request. The load balancer sits between the internet and your servers, receives all traffic, and forwards each request based on a chosen algorithm.

Round Robin

The simplest algorithm: requests are distributed in rotating order — server A, B, C, A, B, C. It works well when all servers have equal capacity and all requests have roughly equal cost. It fails when request costs vary widely: expensive requests (external API calls, heavy DB writes) can all pile up on one server by chance.

Weighted Round Robin

A variation where servers with more capacity receive proportionally more requests. If server A has 8 GB RAM and servers B/C have 4 GB each, A gets twice the traffic. Still blind to request cost.

Least Connections

A smarter algorithm: the load balancer tracks the number of active connections on each server and routes each new request to the server with the fewest. Since expensive operations keep connections open longer, servers handling heavy requests naturally receive fewer new ones. This self-balances across heterogeneous workloads.

LOAD BALANCER least_conn Server A 47 active conns ⚠ 82% ↓ fewer new Server B 12 active conns 36% ← next request Server C 14 active conns 41%
Fig 2 · Least connections routes away from the busy server automatically

Other algorithms

Least Response Time favors servers returning responses fastest — servers already struggling get fewer requests. Resource-Based monitors actual CPU/RAM usage on each server and routes accordingly. IP Hash deterministically maps each client IP to a specific server (useful for session affinity, though statelessness is the better solution).


03

Health Checks

What happens when a server crashes? Without health checks, a round-robin load balancer keeps sending requests to the dead server — every third user gets a 502 error. Health checks solve this with a simple mechanism: the load balancer periodically sends a lightweight test request (typically GET /health) to every backend. Servers that return 200 stay in the rotation; servers that fail get blacklisted until they recover.

LB health check every 1s Server A ✓ 200 Server B ✗ 502 → BLACKLISTED no user traffic sent Server C ✓ 200
Fig 3 · Health checks — dead servers get blacklisted; pings continue until recovery

Health endpoint in Go

Go
// A production health check verifies downstream dependencies, not just "I'm alive."
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    checks := map[string]string{}

    if err := db.PingContext(r.Context()); err != nil {
        checks["database"] = "unreachable"
    } else {
        checks["database"] = "ok"
    }

    if err := rdb.Ping(r.Context()).Err(); err != nil {
        checks["redis"] = "unreachable"
    } else {
        checks["redis"] = "ok"
    }

    for _, v := range checks {
        if v != "ok" {
            w.WriteHeader(http.StatusServiceUnavailable)
            json.NewEncoder(w).Encode(checks)
            return
        }
    }
    json.NewEncoder(w).Encode(checks)
})

04

Read Replicas

Scaling your application servers is straightforward once you externalize state. Scaling the database — the stateful heart of your architecture — is harder. The first and most common approach is read replicas.

The idea: maintain one primary (master) database that handles all write operations (INSERT, UPDATE, DELETE), and one or more replica databases that hold copies of the same data and serve only read operations (SELECT). Since 70–90% of typical SaaS traffic is reads, replicas absorb the majority of the load.

Backend servers WRITES (30%) READS (70%) Primary (US) INSERT/UPDATE/DELETE replication Replica (India) Replica (Japan) Replica (EU) Benefits: › Primary load drops to ~30% › Users get regional low latency › Read capacity scales linearly
Fig 4 · Read replicas offload read traffic and reduce geographic latency

The replication lag problem

Data written to the primary must propagate to replicas, and propagation takes time — the replication lag. If a user updates their name on the primary and immediately refreshes (a read that hits a replica), they may see the old name because the replica hasn't caught up yet.

Solutions include: read-after-write routing — route reads that follow a recent write to the primary instead of a replica; lag-aware delays — delay the read until replication is confirmed; or client-side optimistic updates — show the new value locally before the server confirms.


05

Sharding & Partitioning

Sharding (also called partitioning) means splitting a single large table across multiple physical database instances based on a shard key. Instead of one orders table with 10 billion rows, you have 12 shards — one per month — each holding a fraction of the data.

Sharding solves two problems simultaneously: query latency drops because each shard contains fewer rows to scan, and throughput increases because multiple database instances can serve requests in parallel.

orders 10B rows slow queries single instance bottleneck Shard 1: Jan–Mar 2.5B rows Shard 2: Apr–Jun 2.5B rows Shard 3: Jul–Sep 2.5B rows Shard 4: Oct–Dec 2.5B rows 4× fewer rows per query 4× throughput (parallel shards)
Fig 5 · Sharding divides one massive table across multiple database instances

The hardest part of sharding is choosing the shard key: the column or criteria that determines which shard holds each row. Common strategies include time-based (order date), geographic (user region), or hash-based (hash of user ID modulo shard count). A poorly chosen shard key creates "hot shards" — imbalanced partitions where one shard receives disproportionate traffic.


06

Distributed Databases

Modern distributed databases handle replication, sharding, distributed transactions, and geographic distribution automatically. Instead of manually configuring replicas and shard keys, you sign up for a service and get a connection URL.

DatabaseBase EngineKey Feature
PlanetScaleMySQL (Vitess)Horizontal sharding with online migrations
NeonPostgres (Rust)Serverless Postgres with branching
CockroachDBPostgres-compatibleMulti-region consistency, survives zone failures
YugabyteDBPostgres-compatibleGeo-distributed with tunable consistency
Practical advice

Unless you have deep database administration expertise, do not roll your own replication or sharding infrastructure. Use a managed provider. Your job is to understand these concepts well enough to configure the provider's dashboard — choosing replica regions, backup frequency, and shard strategies — not to implement the algorithms yourself.


07

CDNs & the Speed of Light

Light travels through fiber optic cables at roughly 200,000 km/s. A round trip from Tokyo to Virginia (US-East-1) — about 20,000 km — takes a minimum of 100ms. This is a physics floor that no amount of code optimization can reduce. And 100ms is just the network transit — add server processing, database queries, and serialization, and the total easily reaches 500–800ms.

Content Delivery Networks (CDNs) solve this by placing cached content on edge nodes (also called Points of Presence or PoPs) physically close to users. Instead of a 20,000 km round trip, the request travels 100–200 km, reducing network latency from 100ms to 2–3ms.

Tokyo → Virginia: ~100ms   |   Tokyo → Local CDN: ~2ms

What to cache on a CDN

Static content is the obvious starting point: JavaScript bundles, CSS, HTML files, images, fonts, and videos. These change only on deployment, making them perfect for long-lived caching. API responses can also be cached when the data is relatively stable — product catalogs, blog posts, public configuration. CDN providers like Cloudflare support tag-based cache purging: when data changes, you invalidate only the affected entries.

CDNs as a security layer

CDNs also absorb DDoS attacks. An attacker flooding your origin server with traffic from thousands of bots can crash your server or rack up massive autoscaling bills. With a CDN like Cloudflare in front, the traffic is distributed across their global network — which is large enough to absorb even terabytes-per-second attacks — while automated detection triggers CAPTCHAs and blocks suspicious IPs before traffic reaches your origin.


08

Edge Computing

Traditional CDNs serve static files — no processing, just "find file, send file." Edge computing adds a processing layer at the CDN node: your code runs at the edge, close to the user, before (or instead of) reaching your origin server.

Use cases

Authentication at the edge: Instead of a 100ms round trip to your origin just to verify a session and return 401, the edge node validates the session in 2–3ms. Your origin only receives legitimate, authenticated requests — reducing both latency and load. Localization: Edge nodes detect the user's region and language, serving a Japanese version of your site without a round trip to the origin. Request routing: Intelligent routing to the nearest backend based on geography.

Constraints

Edge computing environments have limited resources — typically 1 GB RAM, 1 CPU core, no filesystem access, no raw TCP sockets. Cloudflare Workers use V8 isolates (Chrome's JavaScript engine) which boot in under 1ms but can only run JavaScript/TypeScript. These constraints mean edge computing complements your origin servers — it can't replace them for complex business logic, database-heavy operations, or long-running tasks.

Cloudflare Worker
// Edge authentication — runs in 2–3ms at the closest CDN node
export default {
  async fetch(request, env) {
    const sessionId = getCookie(request, "session_id");

    if (!sessionId) {
      return new Response("Unauthorized", { status: 401 });
    }

    // Check session in edge KV store (global, low-latency)
    const userId = await env.SESSIONS.get(sessionId);
    if (!userId) {
      return new Response("Unauthorized", { status: 401 });
    }

    // Authenticated → forward to origin server
    request.headers.set("X-User-Id", userId);
    return fetch(request);
  },
};

09

Asynchronous Processing & Queues

Not every operation requires the user to wait for its completion. Asynchronous processing identifies tasks that can happen in the background and offloads them to a message queue — dramatically reducing the latency the user perceives.

The pattern

Consider inviting a team member: you validate the request, save the invitation to the database (100ms), and then need to send an email via an external provider (300ms). The user doesn't need to wait for the email to send — they just need confirmation that the invite was recorded. So you return 200 after the database write, and push "send email" as a job to a queue. A separate worker process picks up the job and handles the email asynchronously.

Synchronous: 100ms (DB) + 300ms (email) = 400ms perceived
Async: 100ms (DB) + push to queue = 100ms perceived

What to offload

Emails & notifications — users don't expect instant delivery. Image/video processing — encoding, resizing, thumbnail generation. Account deletion — cascading deletes across many tables. Report generation — aggregating data across millions of rows. Webhook delivery — retrying on failure without blocking the main request. The rule: if the user doesn't need the result in the same HTTP response, it's a candidate for async processing.

Background job with Go and Redis

Go
import (
    "context"
    "encoding/json"
    "github.com/redis/go-redis/v9"
)

type EmailJob struct {
    To       string `json:"to"`
    Subject  string `json:"subject"`
    Template string `json:"template"`
}

// Producer: push job to queue (called from your API handler)
func EnqueueEmail(ctx context.Context, job EmailJob) error {
    data, _ := json.Marshal(job)
    return rdb.LPush(ctx, "queue:emails", data).Err()
}

// Consumer: runs as a separate process (or goroutine)
func EmailWorker(ctx context.Context) {
    for {
        result, err := rdb.BRPop(ctx, 0, "queue:emails").Result()
        if err != nil { continue }

        var job EmailJob
        json.Unmarshal([]byte(result[1]), &job)
        sendEmail(job)  // takes 300ms — but no user is waiting
    }
}

Background job with Python (Celery)

Python
from celery import Celery

app = Celery("tasks", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3)
def send_invitation_email(self, email: str, invite_url: str):
    try:
        email_provider.send(
            to=email,
            subject="You've been invited!",
            template="invite",
            context={"url": invite_url},
        )
    except Exception as exc:
        raise self.retry(exc=exc, countdown=60)

# In your API handler — returns instantly to the user
send_invitation_email.delay("user@example.com", "https://app.co/invite/abc")
📖 Reference: Celery Documentation · BullMQ (Node.js) — production-grade task queues for Python and JavaScript respectively.

10

Monoliths vs Microservices

A monolith is a single deployable unit: authentication, order processing, notifications, payments — all in one codebase, one process, one deployment. Monoliths are simple to develop, test, deploy, and refactor. You can horizontally scale a monolith by running multiple instances behind a load balancer.

Microservices split the application into independent services, each with its own codebase, deployment, and often its own database. They are primarily about scaling teams, not machines.

When monoliths break down

Deployment coupling: With 200 developers on one codebase, one team's unfinished work can block another team's deployment. Scaling granularity: You can't scale just the payment module — you scale the entire monolith, wasting resources on modules that don't need it. Technology lock-in: Every module must use the same language and runtime, even when a different language would be dramatically better for a specific task (e.g., Rust for image processing vs. Node.js for markdown rendering).

What microservices cost

Network latency: What was a function call is now an HTTP/gRPC call across the network — slower and can fail. Debugging complexity: A single user request may flow through 4+ services; debugging requires distributed tracing across all of them. Data consistency: Each service may own its own database, introducing replication lag and eventual consistency problems. Operational overhead: Each service needs its own CI/CD pipeline, monitoring, alerting, and on-call rotation.

Monolith — Choose When
  • Team < 50–100 engineers
  • Single tech stack is sufficient
  • Scaling needs are uniform
  • You want fast iteration speed
  • You're starting a new product
Microservices — Choose When
  • Team > 100+ engineers with clear boundaries
  • Different modules need different tech stacks
  • Individual modules need independent scaling
  • Deployment velocity is blocked by coupling
  • You have deep distributed systems expertise
The cardinal rule

Microservices are not a performance optimization. They are an organizational architecture for large teams. If your team is small, microservices add massive complexity for zero benefit. A horizontally scaled monolith handles millions of users — most companies never outgrow one.


11

Serverless Computing

Traditional servers are always on — you provision a machine (e.g., 4 GB RAM, 2 cores) and pay for it 24/7, whether it's handling 500 RPS or zero. You must predict capacity: under-provision and users get errors during traffic spikes; over-provision and you waste money.

Serverless flips this model: you provide code (functions), and the platform provisions a machine only when a request arrives. After the response is sent, the machine may be recycled. You pay only for the milliseconds of actual CPU execution — not for idle time.

How it works

An API Gateway receives HTTP requests and routes each one to a serverless function. On the first request, the platform boots a runtime, loads your code, processes the request, and returns a response. The instance may stay warm for a few seconds (handling subsequent requests without cold start) or be destroyed. Scaling is automatic and theoretically unbounded — the platform creates as many instances as needed.

The cold start problem

Spinning up a new execution environment — booting the OS/VM, loading the language runtime, loading your code — takes time. This is the cold start, and it's the primary disadvantage of serverless. Cold starts vary dramatically by technology:

PlatformTechnologyCold Start
Cloudflare WorkersV8 Isolates (JavaScript)~0–5ms
AWS Lambda (Node.js)Firecracker microVM~100–300ms
AWS Lambda (Java)Firecracker + JVM~1–3 seconds
AWS Lambda (Python)Firecracker + CPython~200–500ms

When to use serverless

Good Fit
  • Event-driven tasks (file upload processing, queue consumers)
  • Sporadic, unpredictable traffic
  • Image/video processing pipelines
  • Scheduled jobs (cron-like tasks)
  • Edge authentication & routing
Bad Fit
  • Latency-sensitive APIs (banking, payments)
  • Long-running processes (>15 min on Lambda)
  • WebSocket-heavy applications
  • Apps requiring many persistent DB connections
  • Steady, predictable high-volume traffic
The honest take

Serverless is powerful for specific use cases but is not a universal replacement for servers. The industry is somewhat over-hyped. Traditional servers with autoscaling remain the right choice for most SaaS applications with steady traffic. Serverless shines for event-driven, sporadic workloads where you'd otherwise pay for idle compute.


Scaling Mental Models

After covering the full landscape — from latency percentiles through database sharding, CDNs, async processing, microservices, and serverless — here are the mental models that tie everything together:

1. Always start with the problem

Every technique in these notes is a solution. Before reaching for solutions, measure your system using observability (logs, metrics, traces). Know specifically which component is slow, why it's slow, and how slow it is. Solving the wrong problem is worse than doing nothing — it gives you false confidence while the real bottleneck festers.

2. Prefer simple solutions

A vertically scaled monolith is simpler than a horizontally scaled microservice mesh. Proper database indexing is simpler than adding a Redis cache. Simple solutions are easier to understand, debug, and operate. Complexity is a cost — every component you add is another component that can fail, that must be monitored, that must be understood. Accept complexity only when simplicity is genuinely insufficient.

3. Scale for the problems you have

You don't need to build for a million users on day one — most platforms never reach a million users. Build for your current scale with reasonable headroom, and let observability tell you when and where to invest in scaling. Your application has unique characteristics that generic advice from Netflix engineering blogs may not apply to. Learn from your system's behavior.

4. Measure from day one

Observability is the one exception to "start simple." Production-grade logging, metrics, and tracing should be in place from the first deployment. They pay for themselves immediately — you never have to guess where a bug is, you never get surprised by a traffic spike, and you always have the data to make informed scaling decisions.

Go
// A minimal but real observability setup: Prometheus metrics + structured logging
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "log/slog"
)

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Duration of HTTP requests",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path", "status"},
)

func init() {
    prometheus.MustRegister(requestDuration)
}

// Expose /metrics for Prometheus to scrape
// Visualize in Grafana: p50, p90, p99 latency per endpoint
http.Handle("/metrics", promhttp.Handler())

5. Performance is a mindset

Scaling is not a one-time project. It's an ongoing practice — you build systems, watch them struggle under real traffic, optimize what matters, and learn what doesn't. Over time, through trial and error, you develop an intuition for where bottlenecks hide and which solutions actually move the needle. Your job isn't to predict every failure — it's to build systems that handle failures gracefully and to develop the diagnostic skills to resolve them quickly when they occur.

The Scaling Playbook — In Order

Step 1: Measure everything (observability from day one).
Step 2: Fix the database (indexes, N+1, connection pooling).
Step 3: Add caching (Redis for hot paths).
Step 4: Offload work (async processing for emails, uploads, deletes).
Step 5: Use CDNs (static content + API response caching).
Step 6: Vertical scaling (bigger machine — no code changes).
Step 7: Horizontal scaling (stateless servers + load balancer).
Step 8: Database scaling (read replicas, then sharding if needed).
Step 9: Only then consider microservices, serverless, or edge computing.

↑ Back to top